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Objective,  Questions  for  the  Panel  &  Schedule 


*  Objective:  identify  &  characterize  factors  that  affect  the  impact 
of  Moore’s  Law  on  embedded  applications 


•  Questions  for  the  panel 

-  1).  Moore’s  Law:  what’s  causing  the  slowdown? 

-  2).  What  is  the  contribution  of  Moore’s  Law  to  improvements  at 
the  embedded  system  level? 

-  3).  Can  we  preserve  historical  improvement  rates  for  embedded 
applications? 


Panel  members  &  audience  may  hold  diverse,  evolving  opinions 


•  Schedule 

-  1540-1600: 

-  1600-1620: 

-  1620-1650: 

-  1650-1720: 

-  1720-1730: 


panel  introduction  &  overview 
guest  speaker  Dr.  Robert  Schaller 
panelist  presentations 
open  forum 

conclusions  &  the  way  ahead 
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Panel  Session: 

Amending  Moore’s  Law  for  Embedded  Applications 


Moderator:  Dr.  James  C.  Anderson, 
MIT  Lincoln  Laboratory 


Dr.  Richard  Linderman, 

Air  Force  Research  Laboratory 

Dr.  Mark  Richards, 
Georgia  Institute  of  Technology 


Mr.  David  Martinez, 

MIT  Lincoln  Laboratory 


Dr.  Robert  R.  Schaller, 
College  of  Southern  Maryland 
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Four  Decades  of  Progress  at  the  System  Level 


Gordon  Moore  publishes  Computers  lose  badly  at  chess 

Cramming  more  components 
onto  integrated  circuits” 
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Four  Decades  of  Progress  at  the  System  Level 


Robert  Schaller 
publishes  “Moore’s  Law: 
past,  present  and  future” 


Deep  Blue 

(1270kg)  beats  chess 
champ  Kasparov 
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Law:  past,  present 
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(with  Gary  Shaw) 
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“Sustaining  the 
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growth  of 
embedded  digital 
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processing 
capability” 


Chess  champ 
Kramnik  ties  Deep 
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ties  Deep  Junior 
(10K  lines  C++ 
running  on  15  GIPS 
server  using  3 
Gbytes) 
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Robert  Schaller 
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Computation  Density  (GIPS/Liter) 


Power  per  Unit  Volume  (Watts/Liter)  for 
Representative  Systems  ca.  2003 
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System-level  Improvements  Falling 
Short  of  Historical  Moore’s  Law 
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Timeline  for  ADC  Sampling  Rate 
&  COTS  Processors  (2Q04) 
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Representative  Embedded  Computing  Applications 


Sonar  for  anti-submarine  rocket-launched 
lightweight  torpedo  (high  throughput 
requirements  but  low  data  rates) 


~3m  wingspan 


Radio  for  soldier’s 
software-defined  comm/nav 
system  (severe  size,  weight 
&  power  constraints) 


Radar  for  mini-UAV  surveillance 
applications  (stressing  I/O  data  rates) 


Cost-  &  schedule-sensitive  real-time  applications  with  high 
RAS  (reliability,  availability  &  serviceability)  requirements 
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Effective  NumberofBits 


Embedded  Signal  Processor  Speed  &  Numeric 
Representations  Must  Track  ADC  Improvements 


Typical  Processor 
Numeric  Representation 


Sonar  example  near  limit  of  32  bit  floating-point  (18  ADC  bits 
KSPS  +  5  bits  processing  gain  vs.  23  bit  mantissa  +  sign  bit) 


48-64  bit 
floating-point 

32  bit 

floating-point 
32  bit 

floating-  or 
fixed -point* 


16-32  bit 
fixed-point 


Sampling  Rate  (MSPS) 


*Floating-point  preferred  (same  memory  &  I/O  as  fixed-point) 
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Objective,  Questions  for  the  Panel  &  Schedule 


*  Objective:  identify  &  characterize  factors  that  affect  the  impact 
of  Moore’s  Law  on  embedded  applications 


•  Questions  for  the  panel 

-  1).  Moore’s  Law:  what’s  causing  the  slowdown? 

-  2).  What  is  the  contribution  of  Moore’s  Law  to  improvements  at 
the  embedded  system  level? 

-  3).  Can  we  preserve  historical  improvement  rates  for  embedded 
applications? 


Panel  members  &  audience  may  hold  diverse,  evolving  opinions 


•  Schedule 

-  1540-1600: 

-  1600-1620: 

-  1620-1650: 

-  1650-1720: 

-  1720-1730: 


panel  introduction  &  overview 
guest  speaker  Dr.  Robert  Schaller 
panelist  presentations 
open  forum 

conclusions  &  the  way  ahead 
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Conclusions  &  The  Way  Ahead 


9  Slowdown  in  Moore’s  Law  due  to  a  variety  of  factors 

-  Improvement  rate  was  4X  in  3  yrs,  now  2-3X  in  3  yrs  (still  substantial) 

-  Impact  of  slowdown  greatest  in  “leading  edge”  embedded  applications 

-  Software  issues  may  overshadow  Moore’s  Law  slowdown 

9  COTS  markets  may  not  emerge  in  time  to  support  historical  levels  of 
improvement 

-  Federal  government  support  may  be  required  in  certain  areas  (e.g.,  ADCs) 

-  Possible  return  of  emphasis  on  advanced  packaging  and  custom 
devices/technologies  for  military  embedded  applications 

9  Developers  need  to  overcome  issues  with  I/O  standards  &  provide 
customers  with  cost-effective  solutions  in  a  timely  manner:  success 
may  depend  more  on  economic  &  political  rather  than  technical 
considerations 

9  Hardware  can  be  designed  to  drive  down  software  cost/schedule,  but 
new  methodologies  face  barriers  to  acceptance 

9  Improvements  clearly  come  both  from  Moore’s  Law  &  algorithms,  but 
better  metrics  needed  to  measure  relative  contributions 


“It’s  absolutely  critical  for  the  federal  government  to 
fund  basic  research.  Moore’s  Law  will  take  care  of  itself. 
But  what  happens  after  that  is  what  I’m  worried  about.” 

-  Gordon  Moore,  Nov.  2001 
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Points  of  Reference 


•  6U 


•  1024-point  complex  FFT  (fast  Fourier  transform) 

-  Historical  data  available  for  many  computers  (e.g.,  fftw.org) 

-  Realistic  benchmark  that  exercises  connections  between 
processor,  memory  and  system  I/O 

-  Up  to  5  bits  processing  gain  for  extracting  signals  from  noise 

-  Expect  Ipsec/FFT  (32  bit  floating-point)  on  6U  COTS  card  ~7/05 

Assume  each  FFT  computation  requires  51,200  real  operations 
51.2  GFLOPS  (billions  of  floating  point  operations/sec)  throughput 

1024  MSPS  (million  samples  samples/sec,  complex)  sustained, 
simultaneous  input  &  output  (8  Gbytes/sec  l&O) 


form  factor  card  COTS  (commercial 

Historical  data  available  for  many  systems  multiprocessor  card 

Convection  cooled 

Fans  blow  air  across  heat  sinks 
Rugged  version  uses  conduction  cooling 
Size:  16x23cm,  2cm  slot-to-slot  (0.76L) 

Weight:  0.6kg,  typ. 

Power:  54W  max.  (71W/L) 

Power  limitations  on  connectors  &  backplane 
Reliability  decreases  with  increasing  temperature 

Can  re-package  with  batteries  for  hand-held  applications  (e.g., 
walkie-talkie  similar  to  1L  water  bottle  weighing  1kg) 
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Moore’s  Law  &  Variations,  1965-1997 
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“Original”  Moore’s  Law  (1965,  revised  1975) 

-  4X  transistors/die  every  3  yrs 

-  Held  from  late  ’70s  -  late  ’90s  for  DRAM  (dynamic  random  access 
memory),  the  most  common  form  of  memory  used  in  personal 
computers 

-  Improvements  from  decreasing  geometry,  “circuit  cleverness,”  & 
increasing  die  size 

-  Rates  of  speed  increase  &  power  consumption  decrease  not 
quantified 


“Amended”  Moore’s  Law:  1997  National  Technology  Roadmap 
for  Semiconductors  (NTRS97) 

-  Models  provided  projections  for  1997-2012 

-  Improvement  rates  of  1 .4X  speed  @  constant  power  &  2.8X  density 
(transistors  per  unit  area)  every  3  yrs 

-  For  constant  power,  speed  x  density  gave  max  4X  performance 
improvement  every  3  yrs 

-  Incorrectly  predicted  560  mm2  DRAM  die  size  for  2003  (4X  actual) 


Historically, 
Performance  =  2Years/1 5 


MIT  Lincoln  Laboratory 
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Moore’s  Law  Slowdown,  1999-2003 
(recent  experience  with  synchronous  DRAM) 


Availability  issues:  production  did  not  come  until  4  yrs  after 
development  for  1Gbit  DDR  (double  data  rate)  SDRAMs  (7/99  - 
7/03) 

SDRAM  price  crash 

-  73X  reduction  in  2.7  yrs  (11/99  -  6/02) 

-  Justice  Dept,  price-fixing  investigation  began  in  2002 

Reduced  demand 

-  Users  unable  to  take  advantage  of  improvements  as  $3  SDRAM  chip 
holds  1M  lines  of  code  having  $100M  development  cost  (6/02) 

-  Software  issues  made  Moore’s  Law  seem  irrelevant 

Moore’s  Law  impacted  HW,  not  SW 

Old  SW  development  methods  unable  to  keep  pace  with  HW  improvements 
SW  slowed  at  a  rate  faster  than  HW  accelerated 
Fewer  projects  had  HW  on  critical  path 

In  2000,  25%  of  U.S.  commercial  SW  projects  ($67B)  canceled  outright  with 
no  final  product 

4  yr  NASA  SW  project  canceled  (9/02)  after  6  yrs  (&  $273M)  for  being  5  yrs 
behind  schedule 


System-level  improvement  rates  possibly  slowed  by 
factors  not  considered  in  Moore’s  Law  “roadmap”  models 


MIT  Lincoln  Laboratory 


The  End  of  Moore’s  Law,  2004-20XX 


•  2003  International  Technology  Roadmap  for  Semiconductors  (ITRS03) 

-  Models  provide  projections  for  2003-2018 

-  2003  DRAM  size  listed  as  139  mm2  (1/4  the  area  predicted  by  NTRS97) 

-  Predicts  that  future  DRAM  die  will  be  smaller  than  in  2003 

-  Improvement  rates  of  1 .5X  speed  @  constant  power  &  2X  density  every  3 
yrs 

-  Speed  x  density  gives  max  3X  performance  improvement  every  3  yrs 

-  Limited  by  lithography  improvement  rate  (partially  driven  by  economics) 

•  Future  implications  (DRAMs  &  other  devices) 

-  Diminished  “circuit  cleverness”  for  mature  designs  (chip  &  card  level) 

-  Die  sizes  have  stopped  increasing  (and  in  some  cases  are  decreasing) 

-  Geometry  &  power  still  decreasing,  but  at  a  reduced  rate 

-  Fundamental  limits  (e.g.,  speed  of  light)  may  be  many  (more)  years  away 

Nearest-neighbor  architectures 
3D  structures 

-  Heat  dissipation  issues  becoming  more  expensive  to  address 

-  More  chip  reliability  &  testability  issues 

-  Influence  of  foundry  costs  on  architectures  may  lead  to  fewer  device  types 
in  latest  technology  (e.g.,  only  SDRAMs  and  static  RAM-based  FPGAs) 

Slower  (but  still  substantial)  improvement  rate  predicted,  with  greatest 
impact  on  systems  having  highest  throughput  &  memory  requirements 
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High-Performance  MPU  (microprocessor  unit)  & 
ASIC  (application-specific  integrated  circuit)  Trends 


Year  of  production 

2004 

2007 

2010 

2013 

2016 

MPIJ/ASIC 

1/2  pitch,  nm 

90 

65 

45 

32 

22 

Transistors/chip 

553M 

1106M 

2212M 

4424M 

8848M 

Max  watts  @  volts 

158@1.2 

189@1.1 

218@1.0 

251@0.9 

288@0.8V 

Clock  freq,  MHz 

4,171 

9,285 

15,079 

22,980 

39,683 

Clock  freq,  MHz, 
for  158W  power 

4,171 

7,762 

10,929 

14,465 

21,771 

2003  International  Technology  Roadmap  for  Semiconductors 

-  http://public.itrs.net 

-  Executive  summary  tables  1  i&j,  4c&d,  6a&b 

-  Constant  310  mm2  die  size 

Lithography  improvement  rate  (partially  driven  by  economics) 
allows  2X  transistors/chip  every  3  yrs 

-  1 .5X  speed  @  constant  power 

-  ~3X  throughput  for  multiple  independent  ASIC  (or  FPGA)  cores  while 
maintaining  constant  power  dissipation 

-  ~2X  throughput  for  large-cache  MPUs  (constant  throughput/memory), 
but  power  may  possibly  decrease  with  careful  design 
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Bottleneck  Issues 


*  Bottlenecks  occur  when  interconnection  bandwidth  (e.g., 
processor-to-memory,  bisection  or  system-level  I/O)  is 
inadequate  to  support  the  throughput  for  a  given  application 

*  For  embedded  applications,  I/O  bottlenecks  are  a  greater 
concern  for  general-purpose,  highly  interconnected  back-end 
vs.  special-purpose,  channelized  front-end  processors 


Can  developers  provide  timely,  cost-effective 
solutions  to  bottleneck  problems? 
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Processor  Bottlenecks  at 
Device  &  System  Levels 


•  Device  level  (ITRS03) 

-  2X  transistors  &  1 .5X  speed  every  3  yrs 

High-performance  microprocessor  units  &  ASICs 
Constant  power  &  310  mm2  die  size 

-  3X  throughput  every  3  yrs  possible  if  chip  is  mostly  logic  gates  changing 
state  frequently  (independent  ASIC  or  FPGA  cores) 

-  2X  throughput  every  3  yrs  is  limit  for  microprocessors  with  large  on-chip 
cache  (chip  is  mostly  SRAM  &  throughput/memory  remains  constant) 

-  Possible  technical  solutions  for  microprocessors:  3D  structures,  on-chip 
controller  for  external  L3  cache 

•  System  level 

-  54W  budget  for  hypothetical  6U  COTS  card  computing  32  bit  floating-point 
IK  complex  FFT  every  Ipsec 

10%  (5W)  DC-to-DC  converter  loss 

40%  (22W)  I/O  (7  input  &  7  output  links  @10  Gbits/sec  &  1.5W  ea.,  RF  coax,  2004) 
50%  (27W)  processor  (51  GFLOPS  sustained)  &  memory  (5  Gbytes) 

-  Possible  technical  solutions  for  I/O 

RF  coax  point-to-point  serial  links  with  central  crosspoint  switch  network  (PIN 
diodes  or  MEMS  switches) 

Fiber  optic  links  (may  require  optical  free-space  crosspoint  switch)  &  optical 
chip-to-chip  interconnects 
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Examples  of  Hardware  vs.  Algorithms 
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Static  RAM -based  FPGAs 

-  2002:  system-level  throughput  improved  substantially  vs.  1999 

-  2/3  of  improvement  attributable  to  new  devices,  1/3  to 
architecture  changes 

Chess  computers 

-  1997:  Deep  Blue  provided  40  trillion  operations  per  second  using 
600nm  custom  ASICs  (but  250nm  was  state-of-the-art) 

-  2001 :  Desktop  version  of  Deep  Blue  using  state-of-the-art  custom 
ASICs  feasible,  but  not  built 

-  2002-2003:  improved  algorithms  provide  functional  equivalent  of 
Deep  Blue  using  COTS  servers  instead  of  custom  ASICs 


Speedup  provided  by  FFT  &  other  “fast”  algorithms 


Contributions  of  HW  vs.  algorithms  may  be  difficult  to 
quantify,  even  when  all  necessary  data  are  available 
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Cost  vs.  Time  for  Modern  HS/SW  Development 
Process  (normalized  to  a  constant  funding  level) 


Cost  (effort  & 
expenditures) 

100% 


Initial  operating  capability  SW  has  12%  HW 
utilization,  allowing  8X  growth  over  9  yr  lifetime  (2X 
every  3  yrs):  HW  still  programmable  @  end-of-life 


■ . ■■  Software 


Frequent  SW-only  “tech 
refresh”  provides  upgraded 
capabilities  for  fixed  HW  in 
satellites  &  space  probes, 
ship-based  missiles  & 
torpedoes,  radars,  “software 
radios,”  etc. 


C . Q 

A . A  Hardware 


0  1  2  3  4  Time 

(SW  release  version) 
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Sampling  Rate  (MSPS) 
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Effective  Number  of  Bits 


Improvement  Rates  for 
Highest  Performance  COTS  ADCs,  2Q04 
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Evolution  of  COTS  Embedded 
Multiprocessor  Cards,  2Q04 
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Special-purpose  ASIC 
cards  (~10  FLOPS/byte) 
improving  3X  in  3  yrs 


General-purpose  RISC  (with  on- 
chip  vector  processor)  cards  (~10 
FLOPS/byte)  improving  2X  in  3  yrs 
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GFLOPS  (card-level  sustained 
throughput  for  32-bit  fit  IK  cmplx  FFT) 
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Timeline  for  Highest  Performance 
COTS  Multiprocessors,  2Q04 
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6U  form  factor  cards  <55W 
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Rate  (MSPS) 
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Timeline  for  COTS  Processor  l&O  Rate 
and  ADC  Sampling  Rate  (2Q04) 
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